Dual-Space Analysis of the Sparse Linear Model 



David Wipf and YiWu 

Visual Computing Group, Microsoft Research Asia 

davidwipf @gmail . com, jxwuyiSgmail . com 



Abstract 

Sparse linear (or generalized linear) models combine a standard likelihood func- 
tion with a sparse prior on the unknown coefficients. These priors can conve- 
niently be expressed as a maximization over zero-mean Gaussians with different 
variance hyperparameters. Standard MAP estimation (Type I) involves maximiz- 
ing over both the hyperparameters and coefficients, while an empirical Bayesian 
alternative (Type II) first marginalizes the coefficients and then maximizes over 
the hyperparameters, leading to a tractable posterior approximation. The under- 
lying cost functions can be related via a dual-space framework from |22 |, which 
allows both the Type I or Type II objectives to be expressed in either coefficient 
or hyperparmeter space. This perspective is useful because some analyses or ex- 
tensions are more conducive to development in one space or the other Herein we 
consider the estimation of a trade-off parameter balancing sparsity and data fit. As 
this parameter is effectively a variance, natural estimators exist by assessing the 
problem in hyperparameter (variance) space, transitioning natural ideas from Type 
II to solve what is much less intuitive for Type I. In contrast, for analyses of update 
rules and sparsity properties of local and global solutions, as well as extensions to 
more general likelihood models, we can leverage coefficient-space techniques de- 
veloped for Type I and apply them to Type II. For example, this allows us to prove 
that Type Il-inspired techniques can be successful recovering sparse coefficients 
when unfavorable restricted isometry properties (RIP) lead to failure of popular 
£i reconstructions. It also facilitates the analysis of Type II when non-Gaussian 
likelihood models lead to intractable integrations. 



1 Introduction 



We begin with the likelihood model 

y^<i>x + e, (1) 
where $ e j^nxm ^ dictionary of unit ^2-norm basis vectors, x E M™ is a vector of unknown 
coefficients we would like to estimate, y E M" is the observed signal, and e is noise distributed as 
Af{e; 0, XI) (later we consider more general likelihood models). In many practical situations where 
large numbers of features are present relative to the signal dimension, the problem of estimating x 
given y becomes ill-posed. A Bayesian framework is intuitively appealing for formulating these 
types of problems because prior assumptions must be incorporated, whether explicitly or implicitly, 
to regularize the solution space. 

Recently, there has been a growing interest in models that employ sparse priors p{x) to encourage 
solutions X with mostly small or zero-valued coefficients and a few large or unrestricted values, i.e., 
we are assuming the generative a; is a sparse vector. Such solutions can be favored by using 
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with h concave and non-decreasing on [0, oo) ifTSl [161 . Virtually all sparse priors of interest can 
be expressed in this manner, including the popular Laplacian, Jeffreys, Student's t, and generaUzed 
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Gaussian distributions. Roughly speaking, the 'more concave' h, the more sparse we expect x to be. 
For example, with h{z) — z, we recover a Gaussian, which is not sparse at all, while h{z) — ^ 
gives a Laplacian distribution, with characteristic heavy tails and a sharp peak at zero. 

All sparse priors of the form dU can be conveniently framed in terms of a collection of non-negative 
latent variables or hyperparameters 7 = [71 , . . . , 7™]^ for purposes of optimization, approximation, 
and/or inference. The hyperparameters dictate the structure of the prior via 

P(^) = TT?'(^i)' P^^'i) = niaxA/'(a;i;0,7^)(^(7^), (3) 

7i>0 

i 

where is some non-negative function that is sometimes treated as a hyperprior, although it will 
not generally integrate to one. For the purpose of obtaining sparse point estimates of x, which will 
be our primary focus herein, models with latent variable sparse priors are frequently handled in one 
of two ways. First, the latent structure afforded by (O offers a very convenient means of obtaining 
(possibly local) maximum a posteriori (MAP) estimates of x by iteratively solving 

„2 



a;//) = argmin — logp(y|a;)p(a;) = arg min lly — ^ajllo + A 7 



^+log7, + /(7,) 

1% 
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where /(7i) = — 21og(p(7i) and ajj/) is commonly referred to as a Type I estimator. Examples 
include minimum £p-norm approaches H (TT] [l6l, Jeffreys prior-based methods sometimes called 
FOCUSS LZ algorithms for computing the basis pursuit (BP) or Lasso solution 1,6. ,16. ,18J , 

and iterative reweighted li methods JS). 

Secondly, instead of maximizing over both x and 7 as in (|4]l. Type II methods first integrate out 
(marginalize) the unknown x and then solve the empirical Bayesian problem [,19J 
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argmin y^E-^y + log + ^ /(7,), (5) 
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where Ej, = A/ + <i>r<i>^ and F = diag[7]. Once 7(//) is obtained, the conditional distribution 
p{x\y] 1(11)) is Gaussian, and a point estimate for x naturally emerges as the posterior mean 

=E[a;|y;7(„)] = F(,,)$^ (A/ + $r(„)$^)"' y. (6) 

Pertinent examples include sparse Bayesian learning and the relevance vector machine (RVM) |fT9l , 
automatic relevance determination (ARD) lfT4l . methods for learning overcomplete dictionaries lH, 
and large-scale experimental design ifTTl . 

While initially these two approaches may seem vastly different, both can be directly compared using 
a dual-space view 122)1 of the underlying cost functions. In brief, this involves expressing both the 
Type 1 and Type II objective solely in terms of either a; or 7 as reviewed in Section|2] The dual-space 
view is advantageous for several reasons, such as establishing connections between algorithms, de- 
veloping efficient update rules, or handling more general (non-Gaussian) likelihood functions. In 
Section[3] we utilize 7-space cost functions to develop a principled method for choosing the trade- 
off parameter A (which accompanies the Gaussian likelihood model and essentially balances sparsity 
and data fit) and demonstrate its effectiveness via simulations. Section |4] then derives a new Type 
Il-inspired algorithm in a;-space that can compute maximally sparse (minimal norm) solutions 
even with highly coherent dictionaries, proving a result for clustered dictionaries that previously has 
only been shown empirically |21|. Finally, Section |5] leverages duality to address Type II methods 
with generalized likelihood functions that previously were rendered untenable because of intractable 
integrals. In general, some tasks and analyses are easier to undertake in 7-space (Section[3]), while 
others are more transparent in a;-space (Sections |4] and |5]). Here we consider both with the goal of 
advancing the proper understanding and full utilization of the sparse linear model. 



2 Dual-Space View of the Sparse Linear Model 

Type I is based on a natural cost function in a;-space, p{x\y), while Type II involves an analogous 
function in 7-space, p{l\y)- The dual-space view defines a corresponding 7-space cost function for 
Type I and a a;-space cost function for Type II to complete the symmetry. 
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Type II in x-Space: Using the relationship 

yj:-^y = mm\\\y-^x\\l + x'^r-^x (7) 

as in 1221 . it can be shown that the Type II coefficients from (|6]l satisfy ccj//) = arg niiria; >C(//) (x), 
where 

CiH){x) ^ \\y-<fx\\l + Xg^u~,{x), (8) 

and 

gt^u^ix) ^ min ^ ^ + log + ^ /(7,). (9) 

This reformulation of Type II in cc-space is revealing for multiple reasons (Sections |4] and |5] will 
address additional reasons in detail). For many applications of the sparse linear model, the primary 
goal is simply a point estimate that exhibits some degree of sparsity, meaning many elements of x 
near zero and a few relatively large coefficients. This requires a penalty function ^(a;) that is concave 
and non-decreasing in a;^ = [xf, . . . , x^-^^ . In the context of Type I, any prior p(a;) expressible via 
will satisfy this condition by definition; such priors are said to be strongly super-Gaussian and 
will always have positive kurtosis |15|. Regarding Type II, because the associated a;-space penalty 
(|9|l is represented as a minimum of upper-bounding hyperplanes with respect to x"^ (and the slopes 
are all non-negative given 7 > 0), it must therefore be concave and non-decreasing in UJ. 

For compression, interpretability, or other practical reasons, it is sometimes desirable to have exactly 
sparse point estimates, with many (or most) elements of x equal to exactly zero. This then necessi- 
tates a penalty function g{x) that is concave and non-decreasing in \x\ = [\xi\^ . . . , |a;m|]"^, a much 
stronger condition. In the case of Type I, if log 7 + /(7) is concave and non-decreasing in 7, then 
g{x) = g{xi) satisfies this condition. The Type II analog, which emerges by further inspection 
of (|9]l stipulates that if 

log + /(7.) = log I A-i<i>^$ + r-i| + log |r| + /(7O (10) 

i i 

is a concave and non-decreasing function of 7, then g(^ii){x) will be a concave, non-decreasing 
function of \x\. For this purpose it is sufficient, but not necessary, that / be a concave and non- 
decreasing function. Note that this is a somewhat stronger criteria than Type I since the first term 
on the righthand side of ( fTOl i (which is absent from Type I) is actually convex in 7. Regardless, it is 
now very transparent how Type II may promote sparsity akin to Type I. 

The dual-space view also leads to efficient, convergent algorithms such as iterative reweighted £1 
minimization and its variants as discussed in f22l|. However, building on these ideas, we can demon- 
strate here that it also elucidates the original, widely applied update procedures developed for im- 
plementing the relevance vector machine (RVM), a popular Type II method for regression and clas- 
sification that assumes /(7) = 1 19|. In fact these updates, which were inspired by a fixed-point 
heuristic from lfT2l . have been widely used for a number of Bayesian inference tasks without any 
formal analyses or justification^ The dual-space formulation can be leveraged to show that these 
updates are in fact executing a coordinate-wise, iterative min-max procedure in search of a saddle 
point. Specifically we have the following result (all proofs are in the supplementary material): 

Theorem 1. The original RVM update rule from |fT9l Equation (16)] is equivalent to a closed-form, 
coordinate-wise optimization of 

min max 

x:-l>:0 z^O 

over X, 7, and z, where 'd[z) is the convex conjugate function |[T| of log | A/ + $diag[exp(M)]$^| 
with respect to u. 

'Although a more recent, step- wise variant of the RVM has been shown to be substantially faster f20l, 
the original version is still germane since it can easily be extended to handle more general structured sparsity 
problems. The step-wise method cannot without introducing additional approximations 1101 . 
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Type I in -/-Space: Similar methodology and the expansion of y'^'^y ^y can be used to express 
the Type 1 optimization problem in 7-space, which serves several useful purposes. Let 7(/) = 
arg min^^o ^{i) il), with 

m 

£(,)(7) - ?/^I]-iy + log|r|+^/(7,). (12) 

1=1 

Then the Type I coefficients obtained from ^ satisfy 

= r(,)$^(A/ + $r(,)$^)"'y. (13) 

Section[3]will use 7-space cost functions to derive well-motivated approaches for learning the trade- 
off parameter A. 



3 Choosing the Trade-off Parameter A 



The trade-off parameter is crucial for obtaining good estimates of x. In general, if A is too large, 
X 0; too small and x is overfitted to the noise. In practice, either expensive cross-validation or 
some heuristic procedure is often required. However, because A can be interpreted as a variance, it is 
useful to address its estimation in 7-space, in which existing unknowns (i.e., 7) are also variances. 

Learning A with Type I: Consider the Type I cost function (7). The data-dependent term can be 
shown to be a convex, non-increasing function of 7, which encourages each element to be large. The 
second term is a penalty factor that regulates the size of 7. It is here that a convenient regularizer 
for A can be incorporated. 

This can be accomplished as follows. First we expand Sj, via Sj, — X^Jli 7*^ *'/'!^ + X]j=i ^^j^J' 
where cf).i denotes the i-th column of $ and ej is a column vector of zeros with a '1' in the j-th 
location. Thus we observe that A is embedded in the data-dependent term in the exact same fashion 
as each 7^. This motivates a penalty on A with similar correspondence, leading to the objective 

m n 

£(,)(7,A) ^ y^S-iy + ^[log7, + /(7.)]+E[log^ + /(^)] 

i=i j=i 

m 

= y^I]^iy + E[log7. + /(7.)]+"logA + n/(A). (14) 

1=1 

While admittedly simple, this construction is appealing because, regardless of how each 7^ is penal- 
ized, A is penalized in a proportional manner, so both 7 and A have a properly balanced chance of 
explaining the observed data. This is important because the optimal A will be highly dependent on 
both the true noise level, and crucially, the particular sparse prior assumed p{x) (as reflected by /). 

For analysis or implementational purposes, we may convert Cmil, A) back to a;-space, with A- 
dependency now removed. It can then be shown that solving (|4|, with A fixed to the value that 
minimizes (fT4l i. is equivalent to solving 

miny^ g{xi) + ng ( ^\\u\\2] , s.t. y = ^x + u. (15) 

If a;, and it* minimize ( fTSl l. then we can demonstrate using fTJl that the corresponding A estimate, 
which also minimizes ( fT4l l. is given by A* = dh{z)/dz evaluated at z = l/n||M,||2. Note that if we 
were just performing maximum likelihood estimation of A given a;,, the optimal value would reduce 
to simply A, = l/n||M,||2, with no influence from the prior on a;. This is a fundamental weakness. 

Solving (flST l. or equivalently (fl4] i. can be accomplished using simple iterative reweighted least 
squares, or if g is concave in \xi\, an iterative reweighted second-order-cone (SOC) minimization. 

Learning A with Type II: The same procedure can be adopted for Type II yielding the cost function 

£(//)(7,A) -?/^I]-iy + log|S^|+E/(7,)+n/(A), (16) 
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where we note that, unlike in the Type I case above, the log-based term is already naturally balanced 
between A and 7 by virtue of the symmetric embedding in Ej,. It is important to stress that this 
Type II prescription for learning A is not the same as originally proposed in the literature for Type 
II models of this genre. In this context, (^5(7^) is interpreted a hyperprior on 7^, and an equivalent 
distribution is assumed on the noise variance A. Importantly, these assumptions leave out the factor 
of n in ( fTSI l, and so an asymmetry is created. 

Simulation Examples: Empirical tests help to illustrate the efficacy of this procedure. As in many 
applications of sparse reconstruction, here we are only concerned with accurately estimating x, 
whose nonzero entries may have physical significance (e.g., source localization [16], compressive 
sensing [2|, etc.), as opposed to predicting new values of y. Therefore, automatically learning the 
value of A is particularly relevant, since cross-validation is often not possible^ Simulations are 
helpful for evaluation purposes since we then have access to the true sparse generating vector 

Figure [U compares the estimation performance obtained by minimizing ( fTSl l with two different se- 
lections for g: g{x) = = J^i l^il^^ with p = 0.01 and p = 1.0. Data generation proceeds 
as follows: We create a random 100 x 50 dictionary $, with £2 -normalized, iid Gaussian columns. 
X is randomly generated with 10 unit Gaussian nonzero elements. We then compute y = ^x + e, 
where e is iid Gaussian noise producing an SNR of OdB. To determine what A values lead to optimal 
performance we solve (|4|i with the appropriate g over a range of fixed A values (lO^** to 10^) and 
then compute the error between x and x. The minimum of this curve reflects the best performance 
we can hope to achieve when learning A blindly. In Figure [T] (rop) we plot these curves for both 
Type I methods averaged over 1000 independent trials. 

Next we solve ( fT5] l, which produces an estimate of both x and A. We mark with an 'H-' the learned 
A versus the corresponding error of x. In both cases the learned A's (averaged across trials) perform 
just as well as if we knew the optimal value a priori. Results using other noise levels, problem di- 
mensions n and m, sparsity levels ||a;||o, and sparsity penalties g are similar See the supplementary 
material for more examples. 

Figure [T] (Boffom) shows the average sparsity of estimates x, as quantified by the £0 norm ||i;||o, 
across A values (||a;||o returns a count of the number of nonzero elements in x). The indicates 
the average sparsity of each x for the learned A as before. In general, the £(0.01) penalty produces 
a much sparser estimate, very near the true value of ||a;||o — 10 at the optimal A. The £1 penalty, 
which is substantially less concave/sparsity-inducing, still sets some elements to exactly zero, but 
also substantially shrinks nonzero coefficients in achieving a similar overall reconstruction error 
This highlights the importance of learning a A via a penalty that is properly matched to the prior on 
x: if we instead tried to force a particular sparsity value (in this case 10), then the £1 solution would 
be very suboptimal. Finally we note that maximum likelihood (ML) estimation of A performs very 
poorly (not shown), except in the special case where the ML estimate is equivalent to solving (O 
as occurs when /{j) = (see |6|). The proposed method can be viewed as adding a principled 
hyperprior on A, properly matched to p{x), that compensates for this shortcoming of standard ML. 

Type II A estimation has been explored elsewhere for the special case where /(7) = fl9|, which 
renders the factor of n in (fTST i irrelevant; however, for other selections we have found this factor 
to improve performance (not shown). For space considerations we have focused our attention here 
on Type I, which has frequently been noted for not lending itself well to A estimation (or related 
parameters) |6, 13 1. In fact, the symmetry afforded by the dual-space perspective reveals that Type 
I is just as natural a candidate for this task as Type II, and may be preferred in high-dimensional 
settings where computational resources are at a premium. 

4 Maximally Sparse Estimation 

With the advent of compressive sensing and other related applications, there has been growing inter- 
est in finding maximally sparse signal representations from redundant dictionaries (m ^ n) [3 , 5|. 
The canonical form of this problem involves solving 

a;o = argmin |ja;||o, s.t. y = <I>a;. (17) 



^For example, in non-stationary environments, the value of both x and A may be completely different for 
any new y, which then necessitates that we estimate both jointly. 
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A value A value 



Figure 1: Left: Normalized mean-squared error (MSB) given by (||a; — i;||2/||a;||2) (where the 
average is across 1000 trials) plotted versus A for two different Type 1 approaches. Each black '+' 
represents the estimated value of A (averaged across trials) and the associated MSE produced with 
this estimate. In both cases the estimated value achieves the lowest possible MSE (it can actually 
be slightly lower than the curve because its value is allowed to fluctuate from trial to trial). Right: 
Solution sparsity ||i;||o versus A. Even though they both lead to similar MSE, the ^(o.oi) penalty 
produces a much sparser estimate at the optimal A value. 



While ( fTTl ) is NP-hard, whenever the dictionary <i> satisfies a restricted isometry property (RIP) Q 
or a related structural assumption, meaning that each ||a;o||o columns of $ are sufficiently close 
to orthonormal (i.e., mutually uncorrelated), then replacing with ti in ( fTTb leads to a convex 
problem with an equivalent global solution. Unfortunately however, in many situations (e.g., feature 
selection, source localization) these RIP equivalence conditions are grossly violated, implying that 
the li solution may deviate substantially from Xq. 

An alternative is to instead replace ( fT7] i with minimization of ([8]) and then take the limit as A 0. 
(Note that the extension to the noisy case with A > is straightforward, but analysis is more 
difficult.) In this regime the optimization problem reduces to 

Xiji) = lim argmin g(ii\{x), s.t. y = ^x. (18) 

A— i-O X 

If log jSyl + J2i fill) is concave, then ( fTSl ) can be minimized using reweighted £i minimization. 
With initial weight vector w^^") — 1, the {k + l)-th iteration involves computing 



(fe+i) , • (fc)i I (k+i) , '^9{ii){x) 
x^^ ' arg mm > w- IxA, ' i ^—^ — 

x:y='S>x'^ 0\Xi\ 



(19) 



With /(7) = 0, iterating ( fT9] l will provably lead to an estimate of Xq that is as good or better than the 
ii solution |2T1, in particular when $ has highly correlated columns. Additionally, the assumption 
/(7) = leads to a closed-form expression for the weights toC^+i) . Let 



r]i{x;a,q) = 



(20) 



where denotes a diagonal matrix with i-th diagonal entry given by Then 

can be computed via uij^''^^-' = 7yi(ic; 0, 1/2), It remains unclear however in what circum- 
stances this type of update can lead to guaranteed improvement nor if the functions rii{x; 0, 1/2) 
are even the optimal choice. We will now demonstrate that for certain selections of a and q, we 
can guarantee that reweighted £i using rii{x; a, q) is guaranteed to recover Xq exactly if $ is drawn 
from what we call a clustered dictionary model. 

Definition 1. Clustered Dictionary Model: Let ^Ifncorr denote any dictionary such that £i mini- 
mization succeeds in solving (fTTI l for all |la;ollo ^ d. Let $iorr denote any dictionary obtained 
by replacing each column of ^imcorr with a "cluster" of rrii basis vectors such that the angle be- 
tween any two vectors within a cluster is less than some e > 0. We also define the cluster support 
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r^o C {1,2,..., to} as the set of cluster indices whereby Xq has at least one nonzero element. 
Finally, we assume that the resulting $iorr is such that every n x n submatrix is full rank. 

Theorem 2. For any sparse vector Xq and any dictionary $iorr obtained from the clustered 
dictionary model with e sufficiently small, reweighted £i minimization using weights rii{x;X,q) 
with some q > 1 and a sufficiently small will recover Xf) exactly provided that |f2o| < d, 
SieHo ™« — ^^'^ within each cluster k £ Hq the coefficients do not sum to zero. 

Theorem |2] implies that even though £i may fail to find the maximally sparse Xq because of severe 
RIP violations (high correlations between groups of dictionary columns as dictated by e lead directly 
to a poor RIP), a Type Il-inspired method can still be successful. Moreover, because whenever £i 
does succeed. Type II will always succeed as well (assuming a reweighted £i implementation), the 
converse (RIP violation leading to Type II failure but not £i failure) can never happen. Recent work 
from |21 1 has argued that Type II may be useful for addressing the sparse recovery problem with 
correlated dictionaries, and empirical evidence is provided showing vastly superior performance on 
clustered dictionaries. However, we stress that no results proving global convergence to the correct, 
maximally sparse solution have been shown before in the case of structured dictionaries (except 
in special cases with strong, unverifiable constraints on coefficient magnitudes 11211 ). Moreover, 
the proposed weighting strategy rii{x; A, q) accomplishes this without any particular tuning to the 
clustered dictionary model under consideration and thus likely holds in many other cases as well. 

5 Generalized Likelihood functions 

Type I methods naturally accommodate alternative likelihood functions. We simply must replace the 
quadratic data fit term from (|4|l with some preferred function and then coordinate-wise optimization 
may proceed provided we have an efficient means of computing a weighted ^2-norm penalized 
solution. In contrast, generalizing Type II is substantially more complicated because it is no longer 
possible to compute the marginalization Q or the posterior distribution p(a;|y; 7(//))- Therefore, to 
obtain a tractable estimate a;(//) additional heuristics are required. For example, the RVM classifier 
from [191 employs a Laplace approximation for this purpose; however, it is not clear what cost 
function is being minimized nor rigorous properties of the estimated solutions. 

Fortunately, the dual x-space view provides a natural mechanism for generalizing the basic Type II 
methodology to address alternative likelihood functions in a more principled manner. In the case 
of classification problems, we might want to replace the Gaussian likelihood p{y\x) implied by ([T]i 
with a multivariate Bernoulli distribution p{y\x) cx \og[—ijj{y, x)] where x) is the function 

(y, ^) -Yl i'^ji^)] + (1 - Vj) log [1 - • (21) 

j 

Hereyj € {0, 1} and ct^ (a;) = l/[l+exp(</)J^a;)], with </)j. denotingthe j -throw of $. This function 
may be naturally substituted into the a;-space Type II cost function ^ giving us the candidate 
penalized logistic regression function 

min -0 (y, x) + \g(ii) {x). (22) 

X 

Importantly, recasting Type II classification using a;-space in this way, with its attendant well- 
specified cost function, facilitates more concrete analyses (see below) regarding properties of global 
and local minima that were previously rendered inaccessible because of intractable integrals and 
compensatory approximations. Moreover, we retain a tight connection with the original Type II 
marginalization process as follows. 

Consider the strict upper bound on the function '4}{y, x) (obtained by a Taylor series approximation 
and a Hessian bound) given by 

il){y,x) <TT{y,x,v) = 'ip{y,v) + {v - x)'^ <^'^t + 1/8 {v - x)"^ <^'^<^ (v ~ x) , (23) 

where t = [ti, . . . ,tn]'^ with tj = yj — a-j{v). This bound holds for all v with equality 
when V = X. Using this result we obtain the lower bound on the marginal likelihood given by 
J \og[—ip{y, x)]p{x)dx > J log[— 7r(y, x, v)]p{x)dx. The dual-space framework can then be used 
to derive the following result: 
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Theorem 3. Minimization of ( |22] | with A = 4 is equivalent to solving 



max / exp[-TT{y,x,v)]\\Af{x;0,^i)ip{'yi)dxi 



(24) 



and then computing a;(//) by plugging the resulting 7 into (|6]l. 



Thus we may conclude that ( |22] | provides a principled approximation to (|5]l when a Bernoulli like- 
lihood function is used for classification purposes. In empirical tests on benchmark data sets (see 
supplementary material) using 7(7) = 0, it performs nearly identically to the original RVM (which 
also implicitly assumes f{-f) = 0), but nonetheless provides a more solid theoretical justification 
for Type II classifiers because of the underlying similarities and identical generative model. But 
while the RVM and its attendant approximations are difficult to analyze, (|22] | is relatively transpar- 
ent. Additionally, for other sparse priors, or equivalently other selections for /, we can still perform 
optimization and analyze cost functions without any conjugacy requirements on the implicit p{x). 

Theorem 4. If log | | + J2i fili) ^ concave, non-decreasing function of 7 (as will be the case 
if / is concave and non-decreasing), then every local optimum of (l24l i is achieved at a solution with 
at most n nonzero elements in 7 and therefore Xf^uy In contrast, if — \ogp{x) is convex, then (|24] | 
can be globally solved via a convex program. 

Despite the practical success of the RVM and related Bayesian techniques, and empirical evidence of 
sparse solutions, there is currently no proof that the standard variants of these classification methods 
will always produce exactly sparse estimates. Thus Theorem |4]provides some analytical validation 
of these types of classifiers. 

Finally, if we take (|22] | as our starting point, we may naturally consider modifications tailored to 
specific sparse classification tasks (that may or may not retain an explicit connection with the original 
Type II probabilistic model). For example, suppose we would like to obtain a maximally sparse 
classifier, where regularization is provided by a ||a;||o penalty. Direct optimization is combinatorial 
because of what we call the global zero attraction property: Whenever any individual coefficient Xi 
goes to zero, we are necessarily at a local minimum with respect to this coefficient because of the 
infinite slope (discontinuity) of the £0 norm at zero. However, (|22] | can be modified to approximate 
the Iq without this property as follows. 

Theorem 5. Consider the Type Il-inspired minimization problem 



which is equivalent to (|22] | with /(7) = when ai ~ a2 = A. For some ai and 02 sufficiently 
small (but not necessarily equal), the supporj^ of x will match the support of argmiria; ?/; (y, x) + 
A||a;||o. Moreover, ( l25T l does not satisfy the global zero attraction property. 

Thus Type II affords the possibility of mimicking the norm in the presence of generalized like- 
lihoods but with the advantageous potential for drastically fewer local minima. This is a direction 
for future research. Additionally, while here we have focused our attention on classification via 
logistic regression, these ideas can presumably be extended to other likelihood functions provided 
certain conditions are met. To the best of our knowledge, while already demonstrably successful 
in an empirical setting. Type II classifiers and other related Bayesian generalized likelihood models 
have never been analyzed in the context of sparse estimation as we have done in this section. 

6 Conclusion 

The dual-space view of sparse linear or generalized linear models naturally allows us to transition 
a;-space ideas originally developed for Type I and apply them to Type II, and conversely, apply 7- 
space techniques from Type II to Type I. The resulting symmetry promotes a mutual understanding 
of both methodologies and helps ensure that they are not underutilized. 

^Support refers to the index set of the nonzero elements. 



7 = arg min tj) (y, x) + ai — + log \ a2l + $r$"^ | 




(25) 
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